Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ORT 1.17.0 Release] Cherry pick 1st round #19243

Merged
merged 23 commits into from
Jan 27, 2024

Conversation

YUNQIUGUO
Copy link
Contributor

Description

[ORT 1.17.0 Release] Cherry pick 1st round

PR authors please take a look, and let me know if there are any questions about the changes or approve accordingly.

Motivation and Context

wejoncy and others added 17 commits January 23, 2024 12:17
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…19182)

### Description
Extends the code coverage to Entroy, Histogram and Distribution
calibration method, fix bugs while doing it.



### Motivation and Context
Bugs detected in [Olive](https://github.com/microsoft/OLive).
### Description

allow proxy to load model with 1GB <= size < 2GB

resolves #19157.
…rted (#19179)

### Description
show warning when numThreads is set but threads is not supported.
Resolves #19148, #18933

for web: when crossOriginIsolated is false.
for node: always disable.
…ry (#19174)

### Description
Check the ep_cache_context node property for EPContext node, and don't
allow relative path like "../file_path"
### Description
<!-- Describe your changes. -->
1. Make JBLAS codes an external module of ORT.
2. Move q4 gemm code to contrib_ops.
3. Update template kernel library to v0.1 release.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We found that the current LLM model performance is far below our
expectations. Here is some performance data collected on Mistral-7B
model with Xeon-8480:
8 threads | prompt length=32 past_len=32 | prompt length=1   past_len=32
-- | -- | --
ORT-main | 1220ms | 263ms
Neural-speed | 564ms | 87ms
ORT-this PR|597ms|120ms

Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code,
there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks.
Through some statistics analysis, the summary latency of `MatMulNBits`
is 86.7ms
The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So
other OPs introduce an extra 30ms latency.

The performance of MatMulNBits in this PR meets our expectations.

### Remain Issues
1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses
TaskGranularityFactor to scale its number of threads. This is not
expected in our code design. It may slow down the hybrid CPU performance
by 30~40%.
2. Prepack uses a single thread which is very slow to init a session.
3. MatMulNBits with zero points will fall through to COMP_FP32 even
accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is
not optimized for now. It will be updated in the future. So, for an int4
model with zero points, whether the accuracy_level is 0 or 4 will be no
difference.
### Description
upgrade packages version.

```
# npm audit report

electron  23.0.0-alpha.1 - 23.3.13
Severity: moderate
ASAR Integrity bypass via filetype confusion in electron - GHSA-7m48-wc93-9g85
fix available via `npm audit fix --force`
Will install [email protected], which is a breaking change
node_modules/electron

get-func-name  <2.0.1
Severity: high
Chaijs/get-func-name vulnerable to ReDoS - GHSA-4q6p-r6v2-jvc5
fix available via `npm audit fix`
node_modules/get-func-name

semver  <=5.7.1 || 6.0.0 - 6.3.0 || 7.0.0 - 7.5.1
Severity: moderate
semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw
semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw
semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw
fix available via `npm audit fix`
node_modules/cross-spawn/node_modules/semver
node_modules/global-agent/node_modules/semver
node_modules/semver
```
### Description
This PR updates the LLaMA-2 attention fusions by adding the following.

- Loading the PyTorch model from Hugging Face with the `LlamaAttention`
class before exporting
- Updating the attention mask pattern matching to support another case

This PR also fixes [this
issue](#19040).

### Motivation and Context
Recent changes to Hugging Face's `transformers` library break the
existing pattern matching. Since the attention fusions aim to change the
graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to
`LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately
it does not matter what nodes comprise the `Set of Attention Nodes`
because they will all be removed and replaced by the `Attention Op` in
the end.

Therefore, it does not matter whether the `LlamaAttention` class or a
different attention class is used to load the PyTorch model before
exporting because the expected graphs after the attention fusions will
look identical no matter the attention class chosen. By loading the
PyTorch model with the `LlamaAttention` class instead of other attention
classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then
exporting it to ONNX, the existing pattern matching will continue to
work.
… is not guaranteed (#19195)

Fix issue that the generated context cache model inputs/outputs order is not guaranteed

### Description
Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model.
This is also a break down of PR for multi-partition support.
#18865
…der options (#19154)

Several changes:

1. To align with other EPs' setting of EP context configs in session
options, for example [QNN
EP](#18877), EP context
configs for TRT EP can be configured through:
1. Session Options: `ep.context_enable`, `ep.context_file_path` and
`ep.context_embed_mode`
2. Provider Options: `trt_dump_ep_context_model`,
`trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode`
3. Above setting has 1:1 mapping and provider options has higher
priority over session options.
    
```
    Please note that there are rules for using following context model related provider options:

     1. In the case of dumping the context model and loading the context model,
        for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be
        the absolute path or relative path that is outside of context model directory.
        It means engine cache needs to be in the same directory or sub-directory of context model.

     2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory.
        For example:
        If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled,
           if "trt_ep_context_file_path" is "./context_model_dir",
           - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir"
           - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir"
```    

2. User can decide the naming of the dumped "EP context" model by using
`trt_ep_context_file_path`, please see GetCtxModelPath() for more
details.

3. Added suggested comments from
#18217
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds the following session options to configure the device:
- `soc_model`: The SoC model number. Refer to the QNN SDK documentation
for valid values. Defaults to "0" (unknown).
- `htp_arch`: The minimum HTP architecture the driver will use to select
compatible QNN operators.
- `device_id`: The ID of the device to use when setting 'htp_arch'.
Defaults to "0" (for single device).

### Motivation and Context
Allow more configuration.
…oat16 (#17031)

### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

### Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
### Description
Adds a job to create a nightly python package for ORT/QNN on Windows
ARM64.
Must build onnxruntime-qnn with python 3.11 and numpy 1.25.

**Note: pipeline run may take up to 3 hrs**

### Motivation and Context
Make it possible to get a nightly python package with the latest updates
to QNN EP.
Issue #19161
### Description
Update unet fusion for [stable diffusion webui
extension](https://github.com/tianleiwu/Stable-Diffusion-WebUI-OnnxRuntime):
(1) Update fusion pattern to support fp16 unet model.
(2) Add progress bar
(3) Use a cached map to speed up dtype or shape lookup in shape
inference result.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add BuildArch

To verify:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=400952&view=logs&j=5b022bb4-70a7-5401-8766-a8a7802c7150&t=291e85c7-5547-590b-50de-4e01fcd4eba3&l=14

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@YUNQIUGUO YUNQIUGUO dismissed stale reviews from kunal-vaishnavi and yufenglee via 8bdea53 January 24, 2024 05:24
wangyems
wangyems previously approved these changes Jan 24, 2024
xadupre
xadupre previously approved these changes Jan 24, 2024
tianleiwu
tianleiwu previously approved these changes Jan 24, 2024
@snnn
Copy link
Member

snnn commented Jan 24, 2024

I suggest holding off it a moment, since this PR will add a new dependency, neural-speed, to ONNX Runtime. I have some concerns to it. I've just sent an email to the author who added this component, and another email to a few PMs to discuss it. I do not have any objection on adding the dependency, but there are some details that need to be figured out. Please give me a few days to complete the work.

@YUNQIUGUO
Copy link
Contributor Author

I suggest holding off it a moment, since this PR will add a new dependency, neural-speed, to ONNX Runtime. I have some concerns to it. I've just sent an email to the author who added this component, and another email to a few PMs to discuss it. I do not have any objection on adding the dependency, but there are some details that need to be figured out. Please give me a few days to complete the work.

ok, noticed.

### Description
<!-- Describe your changes. -->
remove old python files


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We have a new op MatMulNBits and this one is deprecated.
@YUNQIUGUO YUNQIUGUO dismissed stale reviews from tianleiwu, xadupre, and wangyems via 0714f0c January 24, 2024 23:57
chilo-ms and others added 4 commits January 24, 2024 17:02
…age (#19251)

C.register_tensorrt_plugins_as_custom_ops() is only available in gpu
python package.
Add condition to avoid calling it in cpu python package.
TRT EP's GetTensorRTCustomOpDomainList() will create vector of
OrtCustomOpDomain objects and release the ownership of those objects.
But, thoses objects are not released forever.
In session level, we need to make TRT EP remember what OrtCustomOpDomain
objects it created and release them at EP destruction time.
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.

Changes:
(1) Reverted #18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build. 

Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.

It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).

Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**

Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc

sh build.sh --config Release  --build_shared_lib --parallel  --use_cuda --cuda_version 11.4 \
            --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --disable_types float8
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update abseil to a release tag and register neural_speed to CG.


### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
@YUNQIUGUO YUNQIUGUO requested a review from a team as a code owner January 26, 2024 20:21
@YUNQIUGUO YUNQIUGUO requested review from snnn and yufenglee January 26, 2024 20:22
@YUNQIUGUO YUNQIUGUO merged commit 3fd94a8 into rel-1.17.0 Jan 27, 2024
108 of 111 checks passed
@YUNQIUGUO YUNQIUGUO deleted the yguo/cherry-pick-1st-round branch January 27, 2024 04:11
YUNQIUGUO added a commit that referenced this pull request Feb 1, 2024
### Description
<!-- Describe your changes. -->

[ORT 1.17.0 Release] Cherry pick 1st round

PR authors please take a look, and let me know if there are any
questions about the changes or approve accordingly.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: wejoncy <[email protected]>
Co-authored-by: Xavier Dupré <[email protected]>
Co-authored-by: Yulong Wang <[email protected]>
Co-authored-by: Hector Li <[email protected]>
Co-authored-by: luoyu-intel <[email protected]>
Co-authored-by: kunal-vaishnavi <[email protected]>
Co-authored-by: Chi Lo <[email protected]>
Co-authored-by: Ye Wang <[email protected]>
Co-authored-by: Adrian Lizarraga <[email protected]>
Co-authored-by: snadampal <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
Co-authored-by: Heflin Stephen Raj <[email protected]>
Co-authored-by: Yifan Li <[email protected]>
Co-authored-by: Yufeng Li <[email protected]>
Co-authored-by: Changming Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.